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Mini-Batch Semi-Stochastic Gradient Descent 

in the Proximal Setting 

Jakub Konecny, Jie Liu, Peter Richtarik, Martin Takac 


Abstract —We propose mS2GD: a method incorporating a 
mini-batching scheme for improving the theoretical complexity 
and practical performance of semi-stochastic gradient descent 
(S2GD). We consider the problem of minimizing a strongly 
convex function represented as the sum of an average of a large 
number of smooth convex functions, and a simple nonsmooth 
convex regularizer. Our method first performs a deterministic 
step (computation of the gradient of the objective function at 
the starting point), followed by a large number of stochastic 
steps. The process is repeated a few times with the last iterate 
becoming the new starting point. The novelty of our method is in 
introduction of mini-batching into the computation of stochastic 
steps. In each step, instead of choosing a single function, we 
sample b functions, compute their gradients, and compute the 
direction based on this. We analyze the complexity of the 
method and show that it benefits from two speedup effects. 
First, we prove that as long as 6 is below a certain threshold, 
we can reach any predefined accuracy with less overall work 
than without mini-batching. Second, our mini-batching scheme 
admits a simple parallel implementation, and hence is suitable 
for further acceleration by parallelization. 

Index Terms —mini-batches, proximal methods, empirical risk 
minimization, semi-stochastic gradient descent, sparse data, 
stochastic gradient descent, variance reduction. 

I. Introduction 

I N this work we are concerned with the problem of mini¬ 
mizing the sum of two convex functions, 

minfPfa;) := F(x) + R(x )}, (1) 

where the first component, F, is smooth, and the second 
component, R, is possibly nonsmooth (and extended real¬ 
valued, which allows for the modeling of constraints). 

In the last decade, an intensive amount of research was 
conducted into algorithms for solving problems of the form 
0. largely motivated by the realization that the underlying 
problem has a considerable modeling power. One of the 
most popular and practical methods for ([T} is the accelerated 
proximal gradient method of Nesterov m, with its most 
successful variant being FISTA 0. 

In many applications in optimization, signal processing and 
machine learning, F has an additional structure. In particular, 
it is often the case that F is the average of a number of convex 
functions: 

F ( x ) = ( 2 ) 

2=1 
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Indeed, even one of the most basic optimization problems— 
least squares regression—lends itself to a natural representa¬ 
tion of the form ([2]). 

A. Stochastic methods. 

For problems of the form ([TJh-Q, and especially when n 
is large and when a solution of low to medium accuracy is 
sufficient, deterministic methods do not perform as well as 
classical .vfoc/za.vfzVQ methods. The prototype method in this 
category is stochastic gradient descent (SGD), dating back 
to the 1951 seminal work of Robbins and Monro 0. SGD 
selects an index i g {1,2,... ,n} uniformly at random, and 
then updates the variable x using V/,(x) — a stochastic 
estimate of S7F(x). Note that the computation of V/j is n 
times cheaper than the computation of the full gradient V F. 
For problems where n is very large, the per-iteration savings 
can be extremely large, spanning several orders of magnitude. 

These savings do not come for free, however (modern 
methods, such as the one we propose, overcome this - more 
on that below). Indeed, the stochastic estimate of the gradient 
embodied by V/, has a non-vanishing variance. To see this, 
notice that even when started from an optimal solution x*, 
there is no reason for V/j(x*) to be zero, which means 
that SGD drives away from the optimal point. Traditionally, 
there have been two ways of dealing with this issue. The first 
one consists in choosing a decreasing sequence of stepsizes. 
However, this means that a much larger number of iterations is 
needed. A second approach is to use a subset (“minibatch”) of 
indices i, as opposed to a single index, in order to form a better 
stochastic estimate of the gradient. However, this results in a 
method which performs more work per iteration. In summary, 
while traditional approaches manage to decrease the variance 
in the stochastic estimate, this comes at a cost. 

B. Modern stochastic methods 

Very recently, starting with the SAG j4j, SDCA Q, SVRG 
|j6l and S2GD {7) algorithms from year 2013, it has tran¬ 
spired that neither decreasing stepsizes nor mini-batching are 

'Depending on conventions used in different communities, the terms 
randomized or sketching are used instead of the word stochastic. In signal 
processing, numerical linear algebra and theoretical computer science, for 
instance, the terms sketching and randomized are used more often. In machine 
learning and optimization, the terms stochastic and randomized are used 
more often. In this paper, stochasticity does not refer to a data generation 
process, but to randomization embedded in an algorithm which is applied to 
a deterministic problem. Having said that, the deterministic problem can and 
often does arise as a sample average approximation of stochastic problem 
(average replaces an expectation), which further blurs the lines between the 
terms. 
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necessary to resolve the non-vanishing variance issue inherent 
in the vanilla SGD methods. Instead, these modern stochastic^] 
method are able to dramatically improve upon SGD in various 
different ways, but without having to resort to the usual 
variance-reduction techniques (such as decreasing stepsizes 
or mini-batching) which carry with them considerable costs 
drastically reducing their power. Instead, these modern meth¬ 
ods were able to improve upon SGD without any unwelcome 
side effects. This development led to a revolution in the area 
of first order methods for solving problem (|T]i+0- Both the 
theoretical complexity and practical efficiency of these modern 
methods vastly outperform prior gradient-type methods. 

In order to achieve e-accuracy, that is, 

E [P(x k ) - P{x*)\ < e[P{x 0 ) - P(®*)], (3) 

modern stochastic methods such as SAG, SDCA, SVRG and 
S2GD require only 

0({n + K) log(l/e)) (4) 


units of work, where k is a condition number associated with 
F, and one unit of work corresponds to the computation of 
the gradient of /, for a random index i, followed by a call 
to a prox-mapping involving R. More specifically, n = L/n , 
where L is a uniform bound on the Lipschitz constants of the 
gradients of functions /* and ji is the strong convexity constant 


of P. These quantities will be defined precisely in Section IV 


The complexity bound ([4]) should be contrasted with that 
of proximal gradient descent (e.g., ISTA), which requires 
0(nn log(l/e)) units of work, or FISTA, which requires 
0(ji\fk log(l/e)) units of worI0 Note that while all these 
methods enjoy linear convergence rate, the modern stochastic 
methods can be many orders of magnitude faster than classical 
deterministic methods. Indeed, one can have 


n + k <C n\fn < m t. 


Based on this, we see that these modern methods always 
beat (proximal) gradient descent (n + n -C uk), and also 
outperform FISTA as long as n < 0(n 2 3 ). In machine learning, 
for instance, one usually has k ss n, in which case the 
improvement is by a factor of -Jn when compared to FISTA, 
and by a factor of n over ISTA. For applications where n is 
massive, these improvements are indeed dramatic. 

For more information about modern dual and primal meth¬ 
ods we refer the reader to the literature on randomized 
coordinate descent methods QD' 0 . in, cd, m, a, cd, 
03, 03, ED, H3, US and stochastic gradient methods a, 
ED, Eoj, ED, ED, ED, CD, ED, respectively. 


C. Linear systems and sketching. 

In the case when R = 0, all stationary points (i.e., points 
satisfying VF( x) = 0) are optimal for 0+0- In the special 

2 These methods are randomized algorithms. However, the term “stochastic” 
(somewhat incorrectly) appears in their names for historical reasons, and quite 
possibly due to their aspiration to improve upon stochastic gradient descent 
(SGD). 

3 However, it should be remarked that the condition number k in these latter 
methods is slightly different from that appearing in the bound El- 


case when the functions f, are convex quadratics of the form 
fi( x) = ^{afx — bi), the equation X7F(x) = 0 reduces to 
the linear system A T Ax = A T b, where A = [ai,...,n]. 
Recently, there has been considerable interest in designing and 
analyzing randomized methods for solving linear systems; also 
known under the name of sketching methods. Much of this 
work was done independently from the developments in (non¬ 
quadratic) optimization, despite the above connection between 
optimization and linear systems. A randomized version of the 
classical Kaczmarz method was studied in a seminal paper by 
Strohmer and Vershynin lf25l . Subsequently, the method was 
extended and improved upon in several ways ED, ED, US), 
(29ll . The randomized Kaczmarz method is equivalent to SGD 
with a specific stepsize choice (301, ED- The first randomized 
coordinate descent method, for linear systems, was analyzed 
by Lewis and Leventhal (32), and subsequently generalized 
in various ways by numerous authors (we refer the reader to 
DD and the references therein). Gower and Richtarik ED 
have recently studied randomized iterative methods for linear 
systems in a general sketch and project framework, which 
in special cases includes randomized Kaczmarz, randomized 
coordinate descent, Gaussian descent, randomized Newton, 
their block variants, variants with importance sampling, and 
also an infinite array of new specific methods. For approaches 
of a combinatorial flavour, specific to diagonally dominant 
systems, we refer to the influential work of Spielman and Teng 

ED¬ 


IT Contributions 

In this paper we equip moderns stochastic methods— 
methods which already enjoy the fast rate (|4]»—with the ability 
to process data in mini-batches. None of the prima^ modern 
methods have been analyzed in the mini-batch setting. This 
paper fills this gap in the literature. 

While we have argued above that the modern methods, 
S2GD included, do not have the “non-vanishing variance” 
issue that SGD does, and hence do not need mini-batching 
for that purpose, mini-batching is still useful. In particular, we 
develop and analyze the complexity of mS2GD (Algorithm |T|) 
— a mini-batch proximal variant of semi-stochastic gradient 
descent (S2GD) (D- While the S2GD method was analyzed 
in the R = 0 case only, we develop and analyze our method 
in the proximaj^] setting 0. We show that mS2GD enjoys 
several benefits when compared to previous modern methods. 
First, it trivially admits a parallel implementation, and hence 
enjoys a speedup in clocktime in an HPC environment. This is 
critical for applications with massive datasets and is the main 
motivation and advantage of our method. Second, our results 
show that in order to attain a specified accuracy e, mS2GD 
can get by with fewer gradient evaluations than S2GD. This 
is formalized in Theorem [2j which predicts more than linear 

4 By a primal method we refer to an algorithm which operates directly to 
solve Q+0 without explicitly operating on the dual problem. Dual methods 
have very recently been analyzed in the mini-batch setting. For a review of 
such methods we refer the reader to the paper describing the QUARTZ method 
ED and the references therein. 

5 Note that the Prox-SVRG method (35) can also handle the composite 
problem {7}. 
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speedup up to a certain threshold mini-batch size after which 
the complexity deteriorates. Third, compared to lf35l . our 
method does not need to average the iterates produced in each 
inner loop; we instead simply continue from the last one. This 
is the approach employed in S2GD (7J. 


in the description of our algorithm in Section [Hip. The outer 
loop ensures that the squared norm of G k .t approaches zero 
as k, t —> oo (it is easy to see that this is equivalent to saying 
that the stochastic estimate G ky t has a diminishing variance), 
which ultimately leads to extremely fast convergence. 


III. The Algorithm 

In this section we first briefly motivate the mathematical 
setup of deterministic and stochastic proximal gradient meth¬ 
ods in Section [TTT|\, followed by the introduction of semi¬ 
stochastic gradient descent in Section [HI)3. We will the be 


ready to describe the mS2GD method in Section III 


A. Deterministic and stochastic proximal gradient methods 

The classical deterministic proximal gradient approach ED, 
EH- EZl to solving Q is to form a sequence {y t } via 

Vt- i-i = arg min U t (x), 

x£R d 

where U t (x) = f F(y t ) + S7F(y t ) T (x - y t ) + jp\\x - y t || * 2 * * * 6 7 + 

R(x). Note that in view of Assumption [TJ which we shall use 
in our analysis in Section [TV] Ut is an upper bound on P 
whenever h > 0 is a stepsize parameter satisfying 1/h > L. 
This procedure can be compactly written using the proximal 
operator as follows; 

Vt+ 1 = P ro x hB.(.yt - hVF(y t )), 

where 

Pro x hR (z) = arg min{i||x - 2 || 2 + hR(x)}. 

cceR d 

In a large-scale setting it is more efficient to instead con¬ 
sider the stochastic proximal gradient approach, in which the 
proximal operator is applied to a stochastic gradient step: 

y t +i = pro x hR (y t - hG t ), (5) 

where G t is a stochastic estimate of the gradient V F(y t ). 


B. Semi-stochastic methods 

Of particular relevance to our work are the SVRG sa, 
S2GD [|7) and Prox-SVRG lf35l methods where the stochastic 
estimate of S7F(y t ) is of the form 

G t = VF(x) + - V/ it (aO), (6) 

where x is an “old” reference point for which the gradient 

def 

VF( x) was already computed in the past, and i t £ [n] = 
{1,2,..., tt.} is a random index equal to i with probability 
q, > 0. Notice that G t is an unbiased estimate of the gradient 
of F at y t \ 

n 

E it [Gt\ ^ \7F(x) + Y / <h^(VMy t )-\7f i (x)) ! X7F(y t ). 

i —1 

Methods such as S2GD, SVRG, and Prox-SVRG update 
the points y t in an inner loop, and the reference point x in 
an outer loop (“epoch”) indexed by k. With this new outer 
iteration counter we will have Xk instead of x, y ky t instead of 
y t and Gk.t instead of G t . This is the notation we will use 


C. Mini-batch S2GD 

We are now ready to describe the mS2GD methotQ (Algo¬ 
rithm Q. 

Algorithm 1 mS2GD 

1 : Input: to (max # of stochastic steps per epoch); h > 0 
(stepsize); Xq £ (starting point); mini-batch size b £ 

[n] 

2 : for k = 0,1,2,... do 

3: Compute and store g k £- S7F(x k ) = £ J2 t ^fi( x k) 

4: Initialize the inner loop: y k ,o <— x k 

5: Choose tk £ {1,2,..., m} uniformly at random 

6: for t = 0 to t k — 1 do 

7: Choose mini-batch A k t Q [n] of size b, 

uniformly at random 

B: Compute a stochastic estimate of X7F(yk y t)'- 

Gk,t £“ gk + 5 — ^ fi{ x k)) 

9 : y k ,t+i WOx hR (y ky t hG ky t ) 

10: end for 

11: Set i Vk,tk 

12: end for 


The algorithm includes an outer loop, indexed by epoch 
counter k, and an inner loop, indexed by t. Each epoch is 
started by computing g k , which is the (full) gradient of F at 
x k . It then immediately proceeds to the inner loop. The inner 
loop is run for t k iterations, where t k is chosen uniformly at 
random from {1,... ,to}. Subsequently, we run t k iterations 
in the inner loop (corresponding to Steps 6-10). Each new 
iterate is given by the proximal update {5 i, however with the 
stochastic estimate of the gradient G k .t in (6), which is formed 
by using a mini-batch of examples A k t C [n] of size \A k t\ = b. 
Each inner iteration requires 2b units of worlQ 

IV. Analysis 

In this section, we lay down the assumptions, state our main 
complexity result, and comment on how to optimally choose 
the parameters of the method. 


A. Assumptions 

Our analysis is performed under the following two assump¬ 
tions. 

6 A more detailed algorithm and the associated analysis (in which we benefit 
from the knowledge of lower-bound on the strong convexity parameters of 
the functions F and R) can be found in the arXiv preprint I2D. The more 
general algorithm mainly differs in being chosen according to a geometric 
probability law which depends on the estimates of the convexity constants. 

7 It is possible to finish each iteration with only b evaluations for com¬ 
ponent gradients, namely {V fi(yk,t)}ieA kt , at the cost of having to store 
{V/i(iCfc)}ig[ n ], which is exactly the way that SAG (4) works. This speeds 
up the algorithm; nevertheless, it is impractical for big n. 
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Assumption 1. Function R : R d — > K U {+ 00 } (regu- 
larizer/proximal term) is convex and closed. The functions 
fi : t R. have Lipschitz continuous gradients with 

constant L > 0. That is, ||V/j(ic) — || < L\\x — y ||, 

for all x,y £ R d , where || • || is the i^-norm. 

Hence, the gradient of F is also Lipschitz continuous with 
the same constant L. 

Assumption 2. P is strongly convex with parameter // > 0. 
That is for all x,y £ dom(f?) and £ £ dP(x), 

P{y) > P{x) + i r (y-x) + ± || y - if, (7) 
where dP(x) is the subdifferential of P at x. 

Lastly, by pp > 0 and pr > 0 we denote the strong 
convexity constants of F and R, respectively. We allow both 
of these quantities to be equal to 0, which simply means that 
the functions are convex (which we already assumed above). 
Hence, this is not necessarily an additional assumption. 


B. Main result 

We are now ready to formulate our complexity result. 

Theorem 1. Let Assumptions [ 7 ] and [ 7 ] be satisfied, let 2 , = 
argmin^ P{x) and choose b £ {1,2..., n}. Assume that 0 < 
h < 1/L, 4hLa(b) < 1 and that m,h are further chosen so 
that 

n d M _I__i_ - 1 

r mhfi(l—4hLct(b )) ' m(l— 4hLa(b)) ’ ^ ' 

where a{b) = Then mS2GD has linear convergence 

in expectation with rate p: 

E [P(x k ) - P(x.)] < P k [P(x 0 ) - P(®*)]. 

Notice that for any fixed b, by properly adjusting the 
parameters h and to we can force p to be arbitrarily small. 
Indeed, the second term can be made arbitrarily small by 
choosing h small enough. Fixing the resulting h, the first 
term can then be made arbitrarily small by choosing to large 
enough. This may look surprising, since this means that only 
a single outer loop (k = 1 ) is needed in order to obtain a 
solution of any prescribed accuracy. While this is indeed the 
case, such a choice of the parameters of the method (to, h, 
k ) would not be optimal - the resulting workload would to 
be too high as the complexity of the method would depend 
sublinearly on e. In order to obtain a logarithmic dependence 
on 1 /e, i.e., in order to obtain linear convergence, one needs to 
perform k = 0 (log(l/e)) outer loops, and set the parameters 
h and m to appropriate values (generally, h = 0{1/L) and 
to = O(k)). 

C. Special cases: b = 1 and b = n 

In the special case with b = 1 (no mini-batching), we get 
a{b) = 1 , and the rate given by ([ 8 ]» exactly recovers the rate 
achieved by Prox-SVRG lf35l (in the case when the Lipschitz 
constants of V/,; are all equal). The rate is also identical to the 
rate of S2GD Q (in the case of R = 0, since S2GD was only 
analyzed in that case). If we set the number of outer iterations 


to k = |4og(l/e)~|, choose the stepsize as h = ( 2 + 4 e )L ' w h ere 
e = exp(l), and choose to = 43k, then the total workload of 
mS2GD for achieving 0 is (n + 43/t) log(l/e) units of work. 
Note that this recovers the fast rate ([4]>. 

In the batch setting, that is when b = n, we have a(b) = 0 
and hence p = l/(mhp). By choosing k = |4og(l/e)], h = 
1/L , and m = 2k, we obtain the rate O (jik log(l/e)). This 
is the standard rate of (proximal) gradient descent. 

Hence, by modifying the mini-batch size b in mS2GD, we 
interpolate between the fast rate of S2GD and the slow rate 
of GD. 


D. Mini-batch speedup 

In this section we will derive formulas for good choices of 
the parameter to, h and k of our method as a function of b. 
Hence, throughout this section we shall consider b fixed. 

Fixing 0 < p < 1, it is easy to see that in order for xt 
to be an e-accurate solution (i.e., in order for ([3| to hold), it 
suffices to choose k > (1 — p ) _1 log(e _1 ). Notice that the 
total workload mS2GD will do in order to arrive at x.p, is 

k(n + 2 to) « (1 — p)^ 1 log(e _1 )(n + 2 to) 

units of work. If we now consider p fixed (we may try 
to optimize for it later), then clearly the total workload is 
proportional to to. The free parameters of the method are the 
stepsize h and the inner loop size to. Hence, in order to set the 
parameters so as to minimize the workload (i.e., optimize the 
complexity bound), we would like to (approximately) solve 
the optimization problem 

min to subject to 0 < h < h < 4 Z J^ , p is fixed. 

Let denote the optimal pair (we highlight the 

dependence on b as it will be useful). Note that if to, < m\/b 
for some b > 1 , then mini-batching can help us reach the e- 
solution with smaller overall workload. The following theorem 
presents the formulas for hf and m\. 

Theorem 2. Fix b and 0 < p < 1 and let 

h b = 1 ( t+p / 2 1 1 _ i+p 

Y V PP / 4 pia{b)L ppi 

If h b < j-, then h b = h b and 

m * = t | i 1 + )) M & ) + \/ i£ 7r i + ( 1 + ^) [M fo )] 2 j . 

(9) 


where k = j/ is the condition number. Ifh b > j-, then h b 
and 

b _ w+4 a(b) 

* p— 4ck(6)(1+p) * 


L 


( 10 ) 


Note that if b = 1, then 

Equation {9]i suggests that as long as the condition h b < L 
holds, m b is decreasing at a rate faster than l/b. Hence, we 
can find the solution with less overall work when using a 
minibatch of size b than when using a minibatch of size 1 . 
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E. Convergence rate 

In this section we study the total workload of mS2GD in 
the regime of small mini-batch sizes. 

Corollary 3. Fix e £ (0,1), choose the number of outer 
iterations equal to 


k = flog(l/e)"| , 

and fix the target decrease in Theorem [2] to satisfy p = e 1 / fe . 
Further, pick a mini-batch size satisfying 1 < b < 29, let the 
stepsize h be as in and let m be as in Then in order 
for mS2GD to find Xk satisfying © mS2GD needs at most 

(n + 2bm b )\\og(l/e)~\ (11) 


units of work, where bm b = O(k), which leads to the overall 
complexity of 

O ((ra + k) log(l/e)) 


units of work. 

Proof. Available in Appendix |B-D 


Q 


This result shows that as long as the mini-batch size is small 
enough, the total work performed by mS2GD is the same as in 
the 6=1 case. If the b updates can be performed in parallel, 
then this leads to linear speedup. 


F. Comparison with Acc-Prox-SVRG 

The Acc-Prox-SVRG l23l method of Nitanda, which was 
not available online before the first version of this paper 
appeared on arXiv, incorporates both a mini-batch scheme and 
Nesterov’s acceleration |f39l , JT]. The author claims that when 
b < [fen], with the threshold b Q defined as s^/lZn —the 

11 v2p(n— l)+8y/K 

overall complexity of the method is 

0 ((n+^«) log(l/e)) ; 
and otherwise it is 

O {{n + byfn) log(l/e)) . 

This suggests that acceleration will only be realized when the 
mini-batch size is large, while for small b, Acc-Prox-SVRG 
achieves the same overall complexity, O ((n + k) log(l/e)), 
as mS2GD. 

We will now take a closer look at the theoretical results 
given by Acc-Prox-SVRG and mS2GD, for each e £ (0,1). 
In particular, we shall numerically minimize the total work of 
mS2GD, i.e., 

(n + 2b\m b ]) [log(l/e)/log(l/p)l , 

over p £ (0,1) and h (compare this with (fTT|); and compare 
these results with similar fine-tuned quantities for Acc-Prox- 
SVRG0 

Fig. □ illustrates these theoretical complexity bounds for 
both ill-conditioned and well-conditioned data. With small- 
enough mini-batch size b, mS2GD is better than Acc-Prox- 
SVRG. However, for a large mini-batch size b, the situation 


reverses because of the acceleration inherent in Acc-Prox- 
SVRG0Plots with b = 64 illustrate the cases where we cannot 
observe any differences between the methods. 

Note however that accelerated methods are very prone to 
error accumulation. Moreover, it is not clear that an efficient 
implementation of Acc-Prox-SVRG is possible for sparse data. 
As shall show in the next section, mS2GD allows for such an 
implementation. 


V. Efficient implementation for sparse data 

Let us make the following assumption about the structure 
of functions /,; in ([2]). 

Assumption 3. The functions f - t arise as the composition of 
a univariate smooth function </>, and an inner product with a 
datapoint/example a; £ ffx) = (f>(af x) for i = 1 ,... ,n. 


Many functions of common practical interest satisfy this 
assumption including linear and logistic regression. Very often, 
especially for large scale datasets, the data are extremely 
sparse, i.e. the vectors { a ,} contains many zeros. Let us denote 
the number of non-zero coordinates of a,i by tu* = ||a;||o < d 
and the set of indexes corresponding to non-zero coordinates 
by supported,;) = {j : aj 7^ 0}, where aj denotes the j th 
coordinate of vector a,. 

Assumption 4. The regularization function R is separable. 

This includes the most commonly used regularization func¬ 
tions as f ||a;|| 2 or A||x||i. 

Let us take a brief detour and look at the classical SGD 
algorithm with R = 0. The update would be of the form 

Xj + 1 «— Xj — hfi'fiajXj)a z = Xj - hS7ffixj). (12) 

If evaluation of the univariate function (p' t takes 0(1) amount 
of work, the computation of V/, will account for work. 

Then the update ( fl2] i would cost O(oj,) too, which implies that 
the classical SGD method can naturally benefit from sparsity 
of data. 

Now, let us get back to the Algorithm |T] Even under the 
sparsity assumption and structural Assumption [3] the Algo¬ 
rithm [T] suggests that each inner iteration will cost 0(ui + d) ~ 
0{d) because <p- is in general fully dense and hence in Step 
[9] of Algorithm [T] we have to update all d coordinates. 

However, in this Section, we will introduce and describe 
the implementation trick which is based on “lazy/delayed” 
updates. The main idea of this trick is not to perform Step 
[9] of Algorithm [T] for all coordinates, but only for coordinates 
j £ UigA fct support (ai). The algorithm is described in Algo¬ 
rithm [2] 


H rn h is the best choice of m for Acc-Prox-SVRG and mS2GD, respectively. 
Meanwhile, h is within the safe upper bounds for both methods. 


9 We have experimented with different values for n, b and k,, and this result 
always holds. 
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Fig. 1: Complexity of Acc-Prox-SVRG and mS2GD in terms of total work done for n = 10,000, and 
small (re = \/v\ top row) and large (re = n 2 ; bottom row) condition number. 


Algorithm 2 ’’Lazy” updates for mS2GD 
(these replace steps 6-10 in Algorithm |T]> 

l: x (i) 0 for j = 1,2, ...,d 
2 : for t = 0 to tk — 1 do 
3: Choose mini-batch A k t Q [n] of size b, 

uniformly at random 
4: for i G A k t do 

5: for j G support(ai) do 

6 : !/*,* <- prox*-^ , gi , R : ft] 

7: <- t 

8: end for 

9: end for 

10: Uk,t+1 Uk,t - | Ei G A fct a-M'iiVk^i) ~ ^ii x l a i)) 

11 : end for 

12 : for j = 1 to d do 

13: vl,t k <“ prox**"^ [y^ 3 - ,g : k ,R, h\ 

14: end for 


To explain the main idea behind the lazy/delayed updates, 
consider that it happened that during the fist r iterations, the 
value of the fist coordinate in all datapoints which we have 
used was 0. Then given the values of yj 0 and gl we can 
compute the true value of y\ t easily. We just need to apply 
the prox operator r times, i.e. y\ T = prox[[yfc j0 ) <7fc> R, ft], 
where the function prox[ is described in Algorithm [3] 

The vector \ in Algorithm [2] is enabling us to keep track of 
the iteration when corresponding coordinate of y was updated 
for the last time. E.g. if in iteration t we will be updating the 

1 st coordinate for the first time, y * 1 * = 0 and after we compute 

and update the true value of y 1 , its value will be set to y 1 = t. 

Lines 5][8 in Algorithm [2] make sure that the coordinates of 
y k>t which will be read and used afterwards are up-to-date. 


Algorithm 3 proxj \y, g, R, h] 

yo = y 

for s = 1,2,... ,r do 

y s <— prox^ H (y s _i - hg) 

end for 
return y° T 


At the end of the inner loop, we will updates all coordinates 
of y to the most recent value (lines [T2l[T4| . Therefore, those 
lines make sure that the yk,t k of Algorithms [l] and [2] will be 
the same. 

However, one could claim that we are not saving any 
work, as when needed, we still have to compute the prox¬ 
imal operator many times. Although this can be true for a 
general function R , for particular cases, R(x) = 4ll :r l| 2 anc l 
R(x) = A||x||f, we provide following Lemmas which give a 
closed form expressions for the proxj operator. 

Lemma 1 (Proximal Lazy Updates with (a-Regularizer). If 
R(x) = ^ || x || 2 with A > 0 then 

proxj[y, g, R, h} = /3 T y J - ^ (1 - /3 T ) g 3 , 
where /? = 1/(1 + A ft). 

Lemma 2 (Proximal Lazy Updates with £| -Regularizer). 
Assume that R(x) = A||x||i with A > 0. Let us define M 
and m as follows, 

M = [A + g 3 ]h, to = — [A — gi]h , 

dcf 

and let [-] + = max{-,0}. Then the value of proxj[y, g, R, ft] 
can be expressed based on one of the 3 situations described 
below: 
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Fig. 2: Comparison of mS2GD with different mini-batch sizes on rcvl (left) and astro-ph (right). 




Fig. 31 Parallelism speedup for rcvl (left) and astro-ph (right) in theory (unachievable in practice). 


def 

1) If g J > A, then by letting p = 
defined as 

proxj [y. g. R, h] 

y J — tM, ifp>r , 

min{?/ J - \p]+M, to} - (r - [p\+)m, if p < r. 



I, f/ze operator can be 


2) If — X < g 3 < A, f/zezz f/ze operator can be defined as 


proXj[y,g,R,h\ 


max{t/ J 


- tM, 0}, 

— rm, 0}, 


if V 3 > 0, 

< 0 . 


dcf 

3) If g J < —A, then by letting q = 
be defined as 


yf I 

m ’ 


— , the operator can 


proxJ[y, g, R,h] 

IV - rm, if q>r, 

jmax}!/ 3 - [ q]+m,M} - (r - [q]+)M, if q < r. 

The proofs of Lemmas [I] and [2] are available in AP¬ 
PENDIX □ 

Remark: Upon completion of the paper, we learned that 
similar ideas of lazy updates were proposed in EBB and 
El for online learning and multinomial logistic regression, 
respectively. However, our method can be seen as a more 
general result applied to a stochastic gradient method and its 
variants under Assumptions [3] and [4] 


VI. Experiments 

In this section we perform numerical experiments to il¬ 
lustrate the properties and performance of our algorithm. In 



loss function: 

fi(x) = log[l + exp(—a;)]. (13) 

These functions are often used in machine learning, with 
(di,bi) G R d x {+1,-1}, i = 1 being a training 

dataset of example-label pairs. The resulting optimization 
problem (|T]h-([2]) takes the form 

P(x) = ^El 1 Mx) + lM 2 , ( 14 ) 

and is used in machine learning for binary classification. In 
these sections we have performed experiments on four publicly 
available binary classification datasets, namely rcvl, news20, 
covtype [^] and astro-ph [^] 

In the logistic regression problem, the Lipschitz constant of 
function V/i is equal to Li = |||| 2 /4. Our analysis assumes 

>0 rcvJ, covtype and news20 are available at http://www.csie.ntu.edu.tw/ 
~cjlin/libsvmtools/datasets/ 

"Available at http://users.cecs.anu.edu.au/~xzhang/data/ 
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Fig. 4: Comparison of several algorithms on four datasets: rcvl (top left), news20 (top right), covtype 
(bottom left) and astro-ph (bottom right). We have used mS2GD with b = 8. 


(Assumption [T} the same constant L for all functions. Hence, 
we have L = max^y L, . We set the regularization parameter 
A = i in our experiments, resulting in the problem having the 
condition number k = = 0{n). In Table 0 we summarize 

the four datasets, including the sizes n, dimensions d, their 
sparsity levels as a proportion of nonzero elements, and the 
Lipschitz constants L. 


Dataset 

n 

d 

Sparsity 

L 

rcvl 

20,242 

47,236 

0.1568% 

0.2500 

news20 

19,996 

1,355,191 

0.0336% 

0.2500 

covtype 

581,012 

54 

22.1212% 

1.9040 

astro-ph 

62,369 

99,757 

0.0767% 

0.2500 


TABLE I: Summary of datasets used for experiments. 


A. Speedup of mS2GD 

Mini-batches allow mS2GD to be accelerated on a computer 
with a parallel processor. In Section IV-D we have shown in 
that up to some threshold mini-batch size, the total workload 
of mS2GD remains unchanged. Figure [2] compares the best 
performance of mS2GD used with various mini-batch sizes 
on datasets rcvl and astro-ph. An effective pass (through the 
data) corresponds to n units of work. Hence, the evaluation of 
a gradient of F counts as one effective pass. In both cases, by 
increasing the mini-batch size to b = 2,4,8, the performance 
of mS2GD is the same or better than that of S2GD (b = 1) 
without any parallelism. 

Although for larger mini-batch sizes mS2GD would be ob¬ 
viously worse, the results are still promising with parallelism. 
In Figure [3jwe show the ideal speedup—one that would be 


achievable if we could always evaluate the b gradients in 
parallel in exactly the same amount of time as it would take 
to evaluate a single gradient^] 

B. mS2GD vs other algorithms 

In this part, we implemented the following algorithms to 
conduct a numerical comparison: 

1) SGDcon: Proximal stochastic gradient descent method 
with a constant step-size which gave the best performance in 
hindsight. 

2) SGD+: Proximal stochastic gradient descent with variable 
step-size h = ho/(k + 1 ), where k is the number of effective 
passes, and ho is some initial constant step-size. 

3) FISTA: Fast iterative shrinkage-thresholding algorithm pro¬ 
posed in El- 

4) SAG: Proximal version of the stochastic average gradient 
algorithm ED. Instead of using h = 1/1 6L, which is analyzed 
in the reference, we used a constant step size. 

5) S2GD: Semi-stochastic gradient descent method proposed 
in 0 . We applied proximal setting to the algorithm and used 
a constant stepsize. 

6 ) mS2GD: mS2GD with mini-batch size 6 = 8 . Although a 
safe step-size is given in our theoretical analyses in Theorem[I] 
we ignored the bound, and used a constant step size. 

In all cases, unless otherwise stated, we have used the best 
constant stepsizes in hindsight. 

Figure [^demonstrates the superiority of mS2GD over other 
algorithms in the test pool on the four datasets described 

12 In practice, it is impossible to ensure that the times of evaluating different 
component gradients are the same. 
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Fig. 5: Original (left) 
and blurred & noisy 
(right) test image. 


above. For mS2GD, the best choices of parameters with 6 = 8 
are given in Table [IT] 


Parameter 

rcvl 

news20 

covtype 

astro-ph 

m 

0.1 In 

O.lOn 

O.Oln 

0.08?r 

h 

5.5 / L 

6 /L 

350 /L 

6.5 /L 


TABLE II: Best choices of parameters in mS2GD. 


C. Image deblurring 

In this section we utilize the Regularization Toolbox [42l p] 
We use the blur function available therein to obtain the original 
image and generate a blurred image (we choose following 
values of parameters for blur function: N = 256, band=9, 
sigma=10). The purpose of the blur function is to generate a 
test problem with an atmospheric turbulence blur. In addition, 
an additive Gaussian white noise with stand deviation of Hr 3 
is added to the blurred image. This forms our testing image as 
a vector 6. The image dimension of the test image is 256 x 256, 
which means that n = d = 65,536. We would not expect 
our method to work particularly well on this problem since 
mS2GD works best when d « n. However, as we shall see, 
the method’s performance is on a par with the performance of 
the best methods in our test pool. 

Our goal is to reconstruct (de-blur) the original image x 
by solving a LASSO problem: min x || Ax — 6 ||| + A||a;||i. We 
have chosen A = 10 -4 . In our implementation, we normalized 
the objective function by n, and hence our objective value 
being optimized is in fact min x Aj|Ax — 6||| + A|jcc|ji, where 
A = ——, similarly as was done in [2]. 

Figure <[5j shows the original test image (left) and a blurred 
image with added Gaussian noise (right). Figure [ 6 ] compares 
the mS2GD algorithm with SGD+, S2GD and FISTA. We 
run all algorithms for 100 epochs and plot the error. The 
plot suggests that SGD+ decreases the objective function very 
rapidly at beginning, but slows down after 10-20 epochs. 

Finally, Fig. [7] shows the reconstructed image after T = 
20,60,100 epochs. 



Fig. 6: Comparison of several algorithms for the de-blurring problem. 


inverse problems in signal processing and statistics. Simi¬ 
larly to SAG, SVRG, SDCA and S2GD, our algorithm also 
outperforms existing deterministic method such as ISTA and 
FISTA. Moreover, we have shown that the method is by design 
amenable to a simple parallel implementation. Comparisons 
to state-of-the-art algorithms suggest that mS2GD, with a 
small-enough mini-batch size, is competitive in theory and 
faster in practice than other competing methods even without 
parallelism. The method can be efficiently implemented for 
sparse data sets. 


Appendix A 
Technical Results 

Lemma 3 (Lemma 3.6 in [35]). Let R be a closed convex 
function on and x,y £ dom(f?), then || prox fl (a:) — 
P rox fl(z/)ll < Ik - 2/|| • 

Note that contractiveness of the proximal operator is a 
standard result in optimization literature ED, ED. 

Lemma 4. Let {£.,:} " = i be vectors in and £ = A ]P ” =1 £ 

R d . Let S be a random subset of [n] of size r, chosen 
uniformly at random from all subsets of this cardinality. Taking 
expectation with respect to S, we have 

<^(^i)E:uii£ii 2 - (is) 


E 


| a y\ 

I r ^—'2 


es 




Following from the proof of Corollary 3.5 in [35], by 
applying Lemma|4]with & := V fi(yk,t) - V fi{x k ), we have 
the bound for variance as follows. 


VII. Conclusion 

We have proposed mS2GD—a mini-batch semi-stochastic 
gradient method—for minimizing a strongly convex compos¬ 
ite function. Such optimization problems arise frequently in 

1 -’Regularization Toolbox available for Matlab can be obtained from http: 
//www.imm.dtu.dk/~pcha/Regutools/ . 


Theorem 4 (Bounding Variance). Let a(b) = , ■ Con¬ 

sidering the definition of G k ,t in Algorithm [7] conditioned on 
Pk,t, we have E[Gfc,t] = \7F(y k ,t) and the variance satisfies, 

E [||G M -VF(y M )|| 2 ] 

< 4La(b)[P(y ktt ) - P(x*) + P{x k ) - P(x*)\. (16) 
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FISTA SGD+ 


S2GD 


mS2GD (b = 4) mS2GD (6 = 8) 


T = 20 


T = 60 


T = 100 


Fig. 7: Reconstruction of the test image from Figure [ 5 ] via FISTA, SGD+, S2GD and mS2GD after 
T = 20, 60,100 epochs (one epoch corresponds to work equivalent to the computation of one gradient.) 



Appendix B 
Proofs 

A. Proof of Lemma [7] 

As in the statement of the lemma, by E[-] we denote 
expectation with respect to the random set S. First, note that 


def„ 

7) = E 


l?E i 6 s&-£ 


— 112 


= E 


ZteM -ikii 


= 4e 




Applying Lemma 3.7 in (35l (this is why we need to assume 
that h<l/L) with x = y k ,t, v = G k ,t, x + = y k ,t+i, 9 = 4,t> 
y = x* and A = A k>t = G k ,t - VP(y fc , t ), we get 

~ d k,t{Vk,t - x*) + f 114,t|| 2 < P(x*) - P(y k ,t+ 1 ) 

-^1 \Vk,t ~ x *|| 2 - ^-|bfc,t+i - x*|| 2 - A ktt (y k ,t +1 - x*), 

(18) 

and therefore. 


If we let C = f ||£|| 2 = ^ (E,;j we can thus write 

v = h E,« 1 + Z E”„ «, T e<) - o 

=*($2nE, J e, T & + (; 


n(n— 1) 


EtiZUi -c 


1 

TIT 


- (-M + Z) E„ ff(, + S E”., Si, 

1 n—T 


(n- 1) 


e: :e,,s c ;% 


< —r 

— nr (? 


EfyELilN 


where in the last step we have used the bound 1 E ? ; 3 = 

n E'-,^ 2 > °- 


„ {T 2 HID 

\\yk,t+i - x*|| s 2 h (P(x*) - P(t/fc,t+i) 

~ A fc,t(t/M +1 - a:*)) + lls/fc,t - x*|| 2 

— ||j/fc,t 111*11 2 dA kAyk,t-\-l X*) 

-2h[P(y k , t+1 ) - P(x.)\. (19) 

2 In order to bound — A x t (y ktt +i ~ x*), let us define the 
proximal full gradient update a^] y k;t +1 = prox^ fl (y M - 
h'VF(y ktt ))- We get 


B. Proof of Theorem [7] 

The proof is following the steps in |[35l . For convenience, 
let us define the stochastic gradient mapping 

4,t = \{Vk,t - y k ,t+ 1 ) = %{Vk,t - pro x hAVk,t - hG kt t)), 

then the iterate update can be written as y k ,t+ 1 = — hdk } t- 

Let us estimate the change of ||z/fe,t+i — x*||. It holds that 

||2/fc,t+l X*|| — ||t/)c,i X* 11 

= || Vk,t - x*|| 2 - 2hd k>t (y k ,t-i ~ x *) + ^ 2 ||4,t|| 2 - (17) 


A M (t/ M +i x *) 

= - A lAVk,t +1 - 2/k,t+l) - A lAVk,t+l - X *) 

= - A fc.t(4,t+1 - X*) - A k,t [P Iox hR{yk,t - hG ki t) 

- P vox h R{yk,t-i ~ hVF(y k ' t - 1 ))] 

Using Cauchy-Schwarz and Lemma [3] we conclude that 

14 Note that this quantity is never computed during the algorithm. We can 
use it in the analysis nevertheless. 
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- A fc,t( 2 /fc ,*+1 - **) 

< ||A fc>t ||||( 2 /fe >t - hG k ,t) - ( yk,t ~ hVF(y k ,t ))|| 
- A I,t(yk,t+i-x*), 

= h\\ A M || 2 - Afe )t (j/ fc)t+ i - x*). (20) 

(20|JT9} 2 

Further, we obtain ||y fc ,t+i - **|| 2 Ss I \Vk,t ~ **11" + 

2h (/i||A fcjt || 2 - Al )t (y k ,t+i - **) - [P{y k ,t+ 1 ) - P(**)]) • 

By taking expectation, conditioned on t/fc,j^]we obtain 

|20}JT9} 2 

E[||t/fc,t+i ^ **|| ] S ||t/fc,t - **|| 

+2 h (h E[|| A*,* || 2 ] - E[P(y M+1 ) - P(x.)]), (21) 

where we have used that E[Afe it ] = E \G k ,t\ — VF(y k ,t) = 0 
and hence E[—A ]\ t {y k t+i — x*)] = (f®l Now, if we substitute 
m into @T} and decrease index t by 1 , we obtain 

(20| JT9) 2 

E[|| 2 /fe,t — x*|| ] S \\yk,t-i ~ X*\\ 

+6[P{yk,t- 1 ) - p(**) + P(*fc) - P(**)] 

—2h’E[P(y k ,t) — P(x*)], ( 22 ) 

where 

9 = 8 Lh 2 a(b) (23) 

and a (b) = . Note that |22| is equivalent to 

E[|| y k , t - x*|| 2 ] + 2/i(E[P(y M ) - P(x*)]) 

< ||y fc) t-i - x *|| 2 

+0 (P(y M -i) - P(**) + P(xfc) - P(x*)). (24) 

Now, by the definition of x k in Algorithm |T| we have that 

m 

E[P(x Jfe+1 )] = A^ E [P(y M )]. (25) 

t= 1 

By summing ( |24| ) for 1 < t < ra, we get on the left hand side 

m 

LHS = Y E[|| y k ,t - ** II 2 ] + 2 h E[P(y M ) - P(x.)] (26) 

t=l 

and for the right hand side we have: 

m 

RHS = Y{v\\yk,t-i-x4 2 + 

t =1 

9E[P(y k ,t-i) - P(x*) + P(xfc) - P(x.)]} 

m— 1 m 

< Yj e IImm - **ll 2 + eY E[P(y*, t ) - P(*.)| 

£=0 t =0 

+0E[P(z fc )-P(:r*)]m. (27) 

15 For simplicity, we omit the E[* | t] notation in further analysis 
16 yk t+i is constant, conditioned on yk t 


Combining |26| and © and using the fact that LF[S < 
RF[S, we have 

m 

E[||t/fc, m - x*|| 2 ] + 2h 'Y J E[P(j/fe,t) - P(x*)] 

i=l 

< E ||t/fc,o - **|| 2 + 6»E[P(x fe ) - P(x*)]m 

m 

+eYnP(y k ,t) - p(*.)] + 6E[P(yk, 0 ) - P(*.)]. 

t=l 

Now, using ( |25] >, we obtain 

E[||y fejm - x*|| 2 ] + 2hmE[P(x k+ i) - P(x*)] 

< E ||t/fc,o - **|| 2 + 0mE[P(x fe ) - P(x*)] 

+0m E[P(xfc +1 ) - P(x*)] + 0E[P(y fc , o ) - P(x*)]. (28) 

Strong convexity (|7]i and optimality of x* imply that 0 € 
9P(x*), and hence for all x € we have 

||*^**|| 2 < §[P(x) - P(x*)]. (29) 

Since E \\y k , m — x*|| 2 > 0 and y k ,o = x k , by combining ( |29| ) 
and ( [28] ) we get 

m(2h — 9) E[P(xfc+i) — P(x*)] 

< (P(*fc) ^ P(x*)) (J + 0 (m + 1)) . 

Notice that in view of our assumption on h and ( |23j ), we have 
2 h > 9, and hence 

E[P(xfe+i) - P(x*)] < P [P(x k ) - P(x*)], 

where P = m^ 2 h~e) + Sh-e) • Applying the above linear 
convergence relation recursively with chained expectations, we 
finally obtain E[P(x*;) — P(x*)] < p k [P(x 0 ) — P(x*)]. 


C. Proof of Theorem [2] 

Clearly, if we choose some value of h then the value of m 
will be determined from f} (i.e. we need to choose m such 
that we will get desired rate). Therefore, m as a function of 
h obtained from | 8 ]i is 

m(h) = - 1+4a(b)h2z ^ 


hy(p— 4o;(b)h.L(p+l)) * 


(30) 


Now, we can observe that the nominator is always positive and 
the denominator is positive only if p > 4a(b)hL(p+l ), which 
im P lies 4 SfS)L ■ f+T > h ( note that ~p+i e [0, |]). Observe 
that this condition is stronger than the one in the assumption 
of Theorem [I] It is easy to verify that 


lim m(h) = +oo, 
h \o 


lim m(h) = +oo. 

h * 1 p 

4a(b)L p+1 


Also note that m(h) is differentiable (and continuous) at any 
h e (0, Yb) L ' f+l> =: Ih ' The derivative of m is given 
by m'(h) = W P) ■ Observe that m'(h) is 

defined and continuous for any h £ Ih- Therefore there have 
to be some stationary points (and in case that there is just on 
Ih) it will be the global minimum on Ih. The FOC gives 


7 b _ -2a(b)L(l+p) + - x /a(b)L(^p 2 +4a(fc)L(l+p)^) 
2 a(b)Lyp 


1 , (1+P) 2 

Aa.{b)Lp ' p 2 p 2 


1+P 

PP 


( 31 ) 
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If this h b £ Ih and also h b < j- then this is the optimal choice 
and plugging ( |3T| into © gives us a 

a) Claim #1: It always holds that h b £ Ih- We just need 
to verify that 

/ t _I (t +p)~ _ t+p ^ i . p 

y 4a(b)Lfi ' f l 2 p 2 /ip ^ 4 a(b)L p+1 5 

which is equivalent to pp 2 + 4a(b)L(l + p ) 2 > 2(1 + 
p)^/a(b)L(pp 2 + 4a(b)L(l + p) 2 ). Because both sides are 
positive, we can square them to obtain the equivalent condition 

pp 2 (pp 2 + 4a(b)L(l + p) 2 ) > 0. 

b) Claim #2: If h b > then h b = The only detail 
which needs to be verified is that the denominator of ( flQ| ) is 
positive (or equivalently we want to show that p > 4a(h)(l + 
p). To see that, we need to realize that in that case we have 
Z<~h b < jJgrx • ~Zy \» which implies that 4a(b)(l + p) < p. 


D. Proof of Corollary [77] 

By substituting definition of h b in Theorem [ 2 ] we get 


h b <i 


j , j def 8pnn+8nK+4pn 
u ^ u 0 pnn+(7p+8)K+4p’ 


where k = L/p. Hence, it follows that if b < |"6o~|> then 
h b = h b and m b is defined in ([9]>; otherwise, h b = j- and m b 
is defined in ( |T()| ). Let e be the base of the natural logarithm. 
By selecting bo = ^”At 7 + 8 e)tt+ 4 , choosing mini-batch size 
b < \bd ], and running the inner loop of mS2GD for 


m b 


8ea{b)n (e + 1 + yj 4 ^ + (1 + ej^ 


iterations with constant stepsize 


(33) 


hh ='j {¥) 2+ <*> 

we can achieve a convergence rate 

0 _1_ 1 4h b La(b)(m b +1) EDED 1 

r rribh b fi(l—4h b La(b)) ' mb(l- 4h b La(b)) e * ' ' 


Since k = |~log(l/e)~| > log(l/e) if and only if e k > 1/e if 

jirl 

and only if e~ k < e, we can conclude that p k — (e“ 1 ) fc = 
e~ k < e. Therefore, running mS2GD for k outer iterations 
achieves e-accuracy solution defined in ([3]). Moreover, since 
in general n e, n e, it can be concluded that 


bo 


ED 


8(l+e)n«+4n 

nAc+(7+8e)/«+4 


8 (e + l) « 29.75, 


then with the definition a (b) 


A” b A , we derive 

b{n— 1) ’ 



l<b<30 

< 


8 eft 

8 etc 


(tAhT ( e + 1 + \J 4A(5jA + (! + e ) 

((e + 1 ) + + (! + e) 2 ) = 


0(k), 


so from the total complexity can be translated to 

O ((n + if) log(l/e)). This result shows that we can reach 
efficient speedup by mini-batching as long as the mini-batch 
size is smaller than some threshold bo « 29.75, which finishes 
the proof for Corollary [3] 


Appendix C 

Proximal lazy updates for l\ and ^-Regularizers 
A. Proof of Lemma [7] 

For any s £ {1,2, ...,r} we have y s = prox ftfl (y s _i - 

rlpf 

hg) = /3(y s -i — hg), where /3 = 1/(1 + A h). Therefore, 

Vr = f3 T Vo - h (Ej=i P 3 ) 9 = P T V - t 1 - P T ] 9- 


B. Proof of Lemma [2] 

Proof. For any s £ {1, 2,..., r} and j £ {1,2,..., d,}, 

y J s = argmin -(x - y 3 s _ l + hg J ) 2 + \h\x\ 

Z 

' vl-i - (A + g j )h , if > (A + g j )h, 

vi-i + (A - g j )h, if < -(X-g j )h, 

0, otherwise, 

’ vi- 1 - M , if > M, 

y J s -1 - rn , if y J s _ 1 < TO, 

0, otherwise. 

where M = (A + g 2 )h, m = f —(A — gi)h and M — m = 
2A h > 0. Now, we will distinguish several cases based on (f: 


When 

gi > A, then M > 

TO = 

-(A - g>)h > 

0, thus 

by letting p = 

y J 

M J ’ 

we 

have 

that: if yi < to, then 

II 

1 

m < 

yi 

< M, 

then yi. = — (r 

- l)m; 

and if 

yi > M, then 






" yi — tM 




if t < p, 


II 

y 3 -pM 

- {t- 

-P) 

TO, 

if r>p&yi- 

- pM < 


.-(t-P 

- 1 )m, 


if r > p & yi 

- pM > 

-J 

yi - tM 




if t < 

P, 

"1 

min{y J - 

pM , 

m) 

- (t- 

1 

V 

P- 

When 

-A < g 3 

< A, 

then M 

= (A + g J )h > 

0, TO = 


— (A — g 3 )h < 0, thus we have that 

J max{yl — rM , 0}, if t/ J > 0, 
|min{y : ' — r?n,0}, if t/ J < 0. 


yi = 


(3) When gi < —A, then to < M = (A + g 2 )h < 0, thus by 


letting q = 


— , we have that: if y J < to, then 

my & — ’ 


»J — 


rm, 


if T<q, 


•y° T = { yi — pm — (t — q)M, if t > q & yi — qm > M, 

— (t — q — 1 )M, if t > q & yi — qm < M, 

{ yi — rm, if t < q, 

ma x{yi — qm , M} — (r — q)M, if t > q; 

if to < yi < M, then yi. = — (r — 1 )M; if yi > M, then 

y 3 r = y J - tM. 


Now, we will perform a few simplifications: Case (1). When 
yi < M, we can conclude that yi. = min {yi,m} — rm. 
Moreover, since the following equivalences hold if g J > A: 
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y j > M ^ > 1 p > 1, and y 3 < M jg < [17] 

1 •£=>■ p < 0 , the situation simplifies to 


y 3 — tM, 


if p > t. 


y 3 T = ^ min {y 3 — pM, to} — (t — p)m, if 1 < p < t, 
minjyf, m } — T rn, if p < 0, 

f V 3 ~ tM, if p > r, 

[min{y - [p] + M,?n} - (r - [p]+)m, if p < r, 

def 

where [■] + = max}-, 0}. For Case (3), when y 3 > to, we can [22] 
conclude that y J T = ma x{y 3 ,M} — tM , and in addition, the 
following equivalences hold when 9 3 < -A: 


y 3 < in <£=> ^ > 1 q > 1, 
j/7 > to -£=> ^ < 1 •£=> q < 0, 

which summarizes the situation as follows: 

! y 3 — tto, if q > t, 

max{ 2 /f — qm, M} — (r — g)M, if 1 < q < r, 
maxjt/f, M} — tM, if q < 0, 

J y 3 — rm, if q > r, 

1 max{ 2 /f - [g]+m, M} - (r - [<?]+)M, if q < r. 


[18] 

[19] 

[ 20 ] 

[ 21 ] 


[23] 

[24] 

[25] 

[26] 

[27] 

[28] 
□ [29] 
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